When you buy a book from Amazon, you get a quote for how much it costs to ship. This is largely based on the weight of the book. If you didn't know the weight of a book, what other characteristics of it could you measure to help predict weight?
When you buy a book from Amazon, you get a quote for how much it costs to ship. This is largely based on the weight of the book. If you didn't know the weight of a book, what other characteristics of it could you measure to help predict weight?
qplot(x = volume, y = weight, data = books)
qplot(x = volume, y = weight, data = books) + geom_abline(intercept = m1$coef[1], slope = m1$coef[2], col = "orchid")
m1 <- lm(weight ~ volume, data = books) summary(m1)
## ## Call: ## lm(formula = weight ~ volume, data = books) ## ## Residuals: ## Min 1Q Median 3Q Max ## -190.0 -109.9 38.1 109.7 145.6 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 107.6793 88.3776 1.22 0.24 ## volume 0.7086 0.0975 7.27 6.3e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 124 on 13 degrees of freedom ## Multiple R-squared: 0.803, Adjusted R-squared: 0.787 ## F-statistic: 52.9 on 1 and 13 DF, p-value: 6.26e-06
Q1: What is the equation for the line?
\[ \hat{y} = 107.7 + 0.708 x \] \[ \widehat{weight} = 107.7 + 0.708 volume \]
Q2: Is volume a significant predictor?
summary(m1)
## ## Call: ## lm(formula = weight ~ volume, data = books) ## ## Residuals: ## Min 1Q Median 3Q Max ## -190.0 -109.9 38.1 109.7 145.6 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 107.6793 88.3776 1.22 0.24 ## volume 0.7086 0.0975 7.27 6.3e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 124 on 13 degrees of freedom ## Multiple R-squared: 0.803, Adjusted R-squared: 0.787 ## F-statistic: 52.9 on 1 and 13 DF, p-value: 6.26e-06
Q3: How much of the variation in weight is explained by the model containing volume?
Q4: Does this appear to be a reasonable setting to apply linear regression?
We need to check:
qplot(x = .fitted, y = .stdresid, data = m1)
qplot(sample = .stdresid, data = m1) + geom_abline(col = "purple")
Allows us to create a model to explain one \(numerical\) variable, the response, as a linear function of many explanatory variables that can be both \(numerical\) and \(categorical\).
We posit the true model:
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon; \quad \epsilon \sim N(0, \sigma^2) \]
We use the data to estimate our fitted model:
\[ \hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \ldots + b_p x_p \]
In least-squares regression, we're still finding the estimates that minimize the sum of squared residuals.
\[ \sum_{i = 1}^n {e_i}^2 = \sum_{i = 1}^n \left(y_i - \hat{y}_i\right)^2\]
In R:
lm(y ~ x1 + x2 + ... + xp, data = mydata)
qplot(x = volume, y = weight, color = cover, data = books)
m2 <- lm(weight ~ volume + cover, data = books) summary(m2)
## ## Call: ## lm(formula = weight ~ volume + cover, data = books) ## ## Residuals: ## Min 1Q Median 3Q Max ## -110.1 -32.3 -16.1 28.9 210.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 197.9628 59.1927 3.34 0.00584 ** ## volume 0.7180 0.0615 11.67 6.6e-08 *** ## coverpb -184.0473 40.4942 -4.55 0.00067 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 78.2 on 12 degrees of freedom ## Multiple R-squared: 0.927, Adjusted R-squared: 0.915 ## F-statistic: 76.7 on 2 and 12 DF, p-value: 1.45e-07
The slope corresponding to the dummy variable tell us:
weight is expected to increase if cover goes from 0 to 1 and volume is left unchanged.Each \(b_i\) tells you how much you expect the \(y\) to change when you change the \(x_i\) by one unit, while holding all other variables constant.
summary(m2)
## ## Call: ## lm(formula = weight ~ volume + cover, data = books) ## ## Residuals: ## Min 1Q Median 3Q Max ## -110.1 -32.3 -16.1 28.9 210.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 197.9628 59.1927 3.34 0.00584 ** ## volume 0.7180 0.0615 11.67 6.6e-08 *** ## coverpb -184.0473 40.4942 -4.55 0.00067 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 78.2 on 12 degrees of freedom ## Multiple R-squared: 0.927, Adjusted R-squared: 0.915 ## F-statistic: 76.7 on 2 and 12 DF, p-value: 1.45e-07
summary(m2)$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 197.963 59.1927 3.34 5.84e-03 ## volume 0.718 0.0615 11.67 6.60e-08 ## coverpb -184.047 40.4942 -4.55 6.72e-04
qt(.025, df = nrow(books) - 3)
## [1] -2.18
Which of the following represents the appropriate 95% CI for coverpb?
The two cover types have different intercepts. Do they share the same slope?
m3 <- lm(weight ~ volume + cover + volume:cover, data = books) summary(m3)
## ## Call: ## lm(formula = weight ~ volume + cover + volume:cover, data = books) ## ## Residuals: ## Min 1Q Median 3Q Max ## -89.7 -32.1 -21.8 17.9 215.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 161.5865 86.5192 1.87 0.089 . ## volume 0.7616 0.0972 7.84 7.9e-06 *** ## coverpb -120.2141 115.6590 -1.04 0.321 ## volume:coverpb -0.0757 0.1280 -0.59 0.566 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 80.4 on 11 degrees of freedom ## Multiple R-squared: 0.93, Adjusted R-squared: 0.911 ## F-statistic: 48.5 on 3 and 11 DF, p-value: 1.24e-06
Do we have evidence that two types of books have different relationships between volume and weight?
This is inference, which required valid models. We'll check diagnostics next time.
qnorm(.025)
## [1] -1.96
qt(.025, df = 13)
## [1] -2.16
qt(.025, df = 14)
## [1] -2.14
qt(.05, df = 13)
## [1] -1.77
qt(.05, df = 14)
## [1] -1.76

zagat
head(nyc, 3)
## Case Restaurant Price Food Decor Service East ## 1 1 Daniella Ristorante 43 22 18 20 0 ## 2 2 Tello's Ristorante 32 20 19 19 0 ## 3 3 Biricchino 34 21 13 18 0
dim(nyc)
## [1] 168 7
What is the unit of observation?
A restaurant
Let's look at the relationship between price, food rating, and decor rating.
You must enable Javascript to view this page properly.
You must enable Javascript to view this page properly.
\[ Price \sim Food + Decor \]
head(nyc, 3)
## Case Restaurant Price Food Decor Service East ## 1 1 Daniella Ristorante 43 22 18 20 0 ## 2 2 Tello's Ristorante 32 20 19 19 0 ## 3 3 Biricchino 34 21 13 18 0
m1 <- lm(Price ~ Food + Decor, data = nyc)
summary(m1)
## ## Call: ## lm(formula = Price ~ Food + Decor, data = nyc) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.945 -3.766 -0.153 3.701 18.757 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -24.500 4.723 -5.19 6.2e-07 *** ## Food 1.646 0.262 6.29 2.7e-09 *** ## Decor 1.882 0.192 9.81 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.79 on 165 degrees of freedom ## Multiple R-squared: 0.617, Adjusted R-squared: 0.612 ## F-statistic: 133 on 2 and 165 DF, p-value: <2e-16
The function for \(\hat{y}\) is . . .
When you have two continuous predictors \(x_1\), \(x_2\), then your mean function is . . .
a plane
You must enable Javascript to view this page properly.
Does the price depend on where the restaurant is located in Manhattan?
\[ Price \sim Food + Decor + East \]
head(nyc, 3)
## Case Restaurant Price Food Decor Service East ## 1 1 Daniella Ristorante 43 22 18 20 0 ## 2 2 Tello's Ristorante 32 20 19 19 0 ## 3 3 Biricchino 34 21 13 18 0
m2 <- lm(Price ~ Food + Decor + East, data = nyc) summary(m2)
## ## Call: ## lm(formula = Price ~ Food + Decor + East, data = nyc) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.045 -3.881 0.039 3.392 17.756 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -24.027 4.673 -5.14 7.7e-07 *** ## Food 1.536 0.263 5.84 2.8e-08 *** ## Decor 1.909 0.190 10.05 < 2e-16 *** ## East 2.067 0.932 2.22 0.028 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.72 on 164 degrees of freedom ## Multiple R-squared: 0.628, Adjusted R-squared: 0.621 ## F-statistic: 92.2 on 3 and 164 DF, p-value: <2e-16
You must enable Javascript to view this page properly.
m3 <- lm(Price ~ Food + Decor + East + Decor:East, data = nyc) summary(m3)
## ## Call: ## lm(formula = Price ~ Food + Decor + East + Decor:East, data = nyc) ## ## Residuals: ## Min 1Q Median 3Q Max ## -13.785 -3.665 0.378 3.729 17.636 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -29.397 6.377 -4.61 8.1e-06 *** ## Food 1.663 0.282 5.90 2.1e-08 *** ## Decor 2.070 0.230 9.01 5.4e-16 *** ## East 9.662 6.218 1.55 0.12 ## Decor:East -0.435 0.352 -1.24 0.22 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.71 on 163 degrees of freedom ## Multiple R-squared: 0.631, Adjusted R-squared: 0.622 ## F-statistic: 69.8 on 4 and 163 DF, p-value: <2e-16
You must enable Javascript to view this page properly.
East term was significant in model 2, suggesting that there is a significant relationship between location and price.Decor to vary with location, and that difference in slopes was also non-significant.
A given data set can conceivably have been generated from infinitely many models. Identifying the true model is like finding a piece of hay in a haystack. Said another way, the model space is massive and the criterion for what constitutes the "best" model is ill-defined.
Best strategy: Use domain knowledge to constrain the model space and/or build models that help you answer specific scientific questions.
Another common strategy:
Tread Carefully!!! The second strategy can lead to myopic analysis, overconfidence, and wrong-headed conclusions.
While we'd like to find the "true" model, in practice we just hope we're doing a good job at:
How smooth should our model be?
m1 <- lm(y ~ x) m2 <- lm(y ~ x + I(x^2)) m3 <- lm(y ~ x + I(x^2) + I(x^3)) m4 <- lm(y ~ x + I(x^2) + I(x^3) + I(x^4))
We can add polynomial terms to account for non-linear trends.
One way to quantify the explanatory power of a model.
\[ R^2 = \frac{SS_{reg}}{SS_{tot}} \]
This captures the proportion of variability in the \(y\) explained by our regression model.
c(summary(m1)$r.squared, summary(m2)$r.squared, summary(m3)$r.squared, summary(m4)$r.squared)
## [1] 0.439 0.919 0.992 0.994
The observed data is best explained by the quartic model. So that's the best model, right?
mBEST <- lm(y ~ poly(x, 20)) c(summary(m1)$r.squared, summary(m2)$r.squared, summary(m3)$r.squared, summary(m4)$r.squared, summary(mBEST)$r.squared)
## [1] 0.439 0.919 0.992 0.994 0.997
But surely that's not the best model…
A measure of explanatory power of model:
\[ R^2 = \frac{SS_{reg}}{SS_{tot}} = 1 - \frac{SS_{res}}{SS_{tot}} \]
But like likelihood, it only goes up with added predictors, therefore we add a penalty.
\[ R^2_{adj} = 1 - \frac{SS_{res}/(n - (p + 1))}{SS_{tot}/(n - 1)} \]
summary(mBEST)$r.squared
## [1] 0.997
summary(mBEST)$adj.r.squared
## [1] 0.994
poverty <- read.delim("poverty.txt", header = TRUE)
m1 <- lm(Poverty ~ Graduates, data = poverty)
summary(m1)
## ## Call: ## lm(formula = Poverty ~ Graduates, data = poverty) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.162 -1.259 -0.218 0.961 5.444 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 64.781 6.803 9.52 9.9e-13 *** ## Graduates -0.621 0.079 -7.86 3.1e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.08 on 49 degrees of freedom ## Multiple R-squared: 0.558, Adjusted R-squared: 0.549 ## F-statistic: 61.8 on 1 and 49 DF, p-value: 3.11e-10
poverty$Noise <- rnorm(nrow(poverty)) m2 <- lm(Poverty ~ Graduates + Noise, data = poverty) summary(m2)
## ## Call: ## lm(formula = Poverty ~ Graduates + Noise, data = poverty) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.287 -1.314 -0.121 0.971 5.406 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 65.2547 6.8260 9.56 1.1e-12 *** ## Graduates -0.6259 0.0792 -7.90 3.1e-10 *** ## Noise 0.3165 0.3299 0.96 0.34 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.08 on 48 degrees of freedom ## Multiple R-squared: 0.566, Adjusted R-squared: 0.548 ## F-statistic: 31.3 on 2 and 48 DF, p-value: 1.98e-09
head(email)
## spam to_multiple from cc sent_email time image attach dollar winner inherit viagra ## 1 0 0 1 0 0 2011-12-31 22:16:41 0 0 0 no 0 0 ## 2 0 0 1 0 0 2011-12-31 23:03:59 0 0 0 no 0 0 ## 3 0 0 1 0 0 2012-01-01 08:00:32 0 0 4 no 1 0 ## 4 0 0 1 0 0 2012-01-01 01:09:49 0 0 0 no 0 0 ## 5 0 0 1 0 0 2012-01-01 02:00:01 0 0 0 no 0 0 ## 6 0 0 1 0 0 2012-01-01 02:04:46 0 0 0 no 0 0 ## password num_char line_breaks format re_subj exclaim_subj urgent_subj exclaim_mess number ## 1 0 11.37 202 1 0 0 0 0 big ## 2 0 10.50 202 1 0 0 0 1 small ## 3 0 7.77 192 1 0 0 0 6 small ## 4 0 13.26 255 1 0 0 0 48 small ## 5 2 1.23 29 0 0 0 0 1 none ## 6 2 1.09 25 0 0 0 0 1 none
# where did this data come from / how was it collected?
Predicting spam or not using the presence of "winner"
## Warning: `position` is deprecated
If "winner" then "spam"?
Predicting spam or not using number of characters (in K)
Predicting spam or not using log number of characters (in K)
If log(num_char) < 1, then "spam"?
Each simple filter can be thought of as a regression model.
\(spam \sim winner; \quad X_1 \sim X_2\)
\(spam \sim log(num\_char); \quad X_1 \sim W_1\)
Each one by itself has poor predictive power, so how can we combine them into a single stronger model?
\[spam \sim log(num\_char)\]
m1 <- glm(spam ~ log(num_char), data = email, family = "binomial") summary(m1)
## ## Call: ## glm(formula = spam ~ log(num_char), family = "binomial", data = email) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.815 -0.467 -0.335 -0.255 3.013 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.7244 0.0606 -28.4 <2e-16 *** ## log(num_char) -0.5435 0.0365 -14.9 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2437.2 on 3920 degrees of freedom ## Residual deviance: 2190.3 on 3919 degrees of freedom ## AIC: 2194 ## ## Number of Fisher Scoring iterations: 5
m2 <- glm(spam ~ log(num_char) + to_multiple + attach + dollar + inherit +
viagra, data = email, family = "binomial")
summary(m2)
## ## Call: ## glm(formula = spam ~ log(num_char) + to_multiple + attach + dollar + ## inherit + viagra, family = "binomial", data = email) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.931 -0.455 -0.328 -0.236 3.034 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.59642 0.06443 -24.78 < 2e-16 *** ## log(num_char) -0.54869 0.03831 -14.32 < 2e-16 *** ## to_multiple -1.92889 0.30493 -6.33 2.5e-10 *** ## attach 0.19970 0.06552 3.05 0.0023 ** ## dollar -0.00456 0.01898 -0.24 0.8102 ## inherit 0.40003 0.15166 2.64 0.0083 ** ## viagra 1.73607 40.59296 0.04 0.9659 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2437.2 on 3920 degrees of freedom ## Residual deviance: 2106.3 on 3914 degrees of freedom ## AIC: 2120 ## ## Number of Fisher Scoring iterations: 11
A GLM consists of three things:
MLR : Normal distribution, identity link function
Logistic Regression : Binomial distribution, logit link function
Poisson Regression : Poisson distribution, logarithm link function